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WHY  PATTERN  SEARCH  WORKS* 

ROBERT  MICHAEL  LEWIS*,  VIRGINIA  TORCZON*,  AND  MICHAEL  W.  TROSSET§ 


Abstract.  Pattern  search  methods  are  a  class  of  direct  search  methods  for  nonlinear  optimization. 
Since  the  introduction  of  the  original  pattern  search  methods  in  the  late  1950s  and  early  1960s,  they  have 
remained  popular  with  users  due  to  their  simplicity  and  the  fact  that  they  work  well  in  practice  on  a  variety 
of  problems.  More  recently,  the  fact  that  they  are  provably  convergent  has  generated  renewed  interest  in  the 
nonlinear  programming  community.  The  purpose  of  this  article  is  to  describe  what  pattern  search  methods 
are  and  why  they  work. 
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1.  Introduction.  Pattern  search  methods  are  a  class  of  direct  search  methods  for  nonlinear  optimiza¬ 
tion.  Since  the  introduction  of  the  original  pattern  search  methods  in  the  late  1950s  and  early  1960s  [2,  5], 
they  have  remained  popular  with  users  due  to  their  simplicity  and  the  fact  that  they  work  well  in  practice 
on  a  variety  of  problems.  More  recently,  the  fact  that  they  are  provably  convergent  has  generated  renewed 
interest  in  the  nonlinear  programming  community. 

The  purpose  of  this  article  is  to  describe  what  pattern  search  methods  are  and  why  they  work.  Much 
of  our  past  work  on  pattern  search  methods  was  guided  by  a  desire  to  unify  a  variety  of  existing  algorithms 
and  provide  them  with  a  common  convergence  theory.  Unfortunately  the  unification  of  this  broad  class 
of  algorithms  requires  a  technical  framework  that  obscures  the  features  that  distinguish  pattern  search 
algorithms  and  make  them  work.  We  hope  here  to  give  a  clearer  explanation  of  these  ideas.  Space  does  not 
allow  us  to  do  justice  to  the  history  of  these  methods  and  all  the  work  relating  to  them;  this  will  be  the 
subject  of  a  lengthier  review  elsewhere;  for  a  historical  perspective,  see  [17]. 

2.  A  Simple  Example  of  Pattern  Search.  We  begin  our  discussion  with  a  simple  instance  of  a 
pattern  search  algorithm  for  unconstrained  minimization:  minimize  f(x).  At  iteration  k,  we  have  an  iterate 
Xk  £  IFC1  and  a  step-length  parameter  A*  >  0.  Let  e*,  i  =  1, . . .  ,n,  denote  the  standard  unit  basis  vectors. 
We  successively  look  at  the  points  x+  =  Xk  =b  A/^,  i  =  1, . . .  ,n,  until  we  find  x+  for  which  /(#+)  <  f(x &). 
Fig.  2.1  illustrates  the  set  of  points  among  which  we  search  for  £+  for  n  =  2.  This  set  of  points  is  an 
instance  of  what  we  call  a  pattern,  from  which  pattern  search  takes  its  name.  If  we  find  no  x+  such  that 
/(#+)  <  f(xk),  then  we  reduce  A*  by  a  half  and  continue;  otherwise,  we  leave  the  step-length  parameter 
alone,  setting  Ak+i  =  A&  and  Xk+i  =  x+.  In  the  latter  case  we  can  also  increase  the  step- length  parameter, 
say  by  a  factor  of  2,  if  we  feel  a  longer  step  might  be  justified.  We  repeat  the  iteration  just  described  until 
Afc  is  deemed  sufficiently  small. 
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Fia.  2.1.  A  simple  instance  of  pattern  search 


This  simple  example  illustrates  two  attractive  features  of  pattern  search  algorithms: 

•  They  can  be  extremely  simple  to  specify  and  implement. 

•  No  explicit  estimate  of  the  derivative  nor  anything  like  a  Taylor’s  series  appears  in  the  algorithm. 
This  makes  these  algorithms  useful  in  situations  where  derivatives  are  not  available  and  finite- 
difference  derivatives  are  unreliable,  such  as  when  f(x)  is  noisy. 

These  qualities  have  made  pattern  search  algorithms  popular  with  users.  Yet  despite  their  seeming  simplicity 
and  heuristic  nature  and  the  fact  that  they  do  not  have  explicit  recourse  to  the  derivatives  of  f(x),  pattern 
search  algorithms  possess  global  convergence  properties  that  are  almost  as  strong  as  those  of  comparable 
line-search  and  trust-region  algorithms.  In  this  article  we  will  attempt  to  explain  this  perhaps  surprising 
fact. 

Before  turning  to  the  discussion  of  how  this  can  be,  we  note  some  further  features  of  pattern  search 
which  are  manifest  in  this  simple  example. 

•  We  require  only  simple  decrease  in  f(x).  In  fact,  we  do  not  even  need  to  know  f(x)  as  a  numerical 
value,  provided  we  can  make  the  assessment  that  f(x+)  is  an  improvement  on  f(xk). 

•  If  we  are  lucky,  we  need  only  a  single  evaluation  of  f(x)  in  any  given  iteration.  Once  we  find  an  x+ 
for  which  f{x+)  <  f(Xk ),  we  can  accept  it  and  proceed.  On  the  other  hand,  in  the  worst  case  we 
will  look  in  quite  a  few  directions  (2 n,  for  this  example)  before  we  try  shorter  steps. 

•  The  steps  that  are  allowed  are  restricted  in  direction  and  length.  In  this  example,  the  steps  must  lie 
parallel  to  the  coordinate  axes  and  the  length  of  any  step  has  the  form  Ao/2^  for  some  integer  N. 

This  simple  example  also  suggests  that  there  is  a  great  deal  of  flexibility  in  pattern  search  algorithms, 
depending  on  how  one  specifies  the  pattern  of  points  to  be  searched  for  the  next  iterate.  These  features  will 
be  recurring  themes  in  our  discussion. 

3.  The  General  Pattern  Search  Algorithm.  For  simplicity,  our  discussion  will  focus  primarily  on 
the  case  of  unconstrained  minimization, 


minimize  f{x). 

We  assume  that  /  is  continuously  differentiable,  but  that  information  about  the  gradient  of  /  is  either 
unavailable  or  unreliable.  Since  the  inception  of  pattern  search  methods,  various  techniques  have  also  been 
used  to  apply  them  to  solve  the  general  nonlinear  programming  problem 

minimize  f(x) 

subject  to  c(x)  >  0 
t  <  X  <  u. 


More  recently,  pattern  search  methods  specifically  designed  for  constrained  problems  with  an  attendant 
convergence  theory  have  been  developed  in  [6,  9,  8]. 

The  form  of  a  general  pattern  search  algorithm  is  quite  simple  and  not  all  that  different  from  any  other 
nonlinear  minimization  algorithm:  first,  find  a  step  sk  from  the  current  iterate  xk ;  second,  determine  if  that 
step  is  acceptable;  and  finally,  update  the  critical  components  of  the  algorithm.  At  iteration  k  pattern  search 
methods  will  consider  steps  in  directions  denoted  by  dk .  We  require  dk  to  be  a  column  of  Dk ,  where  Dk  is 
an  nxpk  matrix  (i.e.,  Dk  represents  the  set  of  directions  under  consideration). 

Generalized  pattern  search: 

Given  xQ  £  IRn ,  f{x0),  D0  G  IRnXp\  and  A0  >  0, 
for  k  =  0, 1, . . .  until  done  do  { 

1.  Find  a  step  sk  =  Akdk  using  the  procedure  ExploratoryJYIoves(AA;,  Dk). 

2.  If  f(xk  +  A kdk)  <  f(xk),  then  xM  =  xk  +  Afe4;  otherwise,  xk+1  =  xk. 

3.  Update (Afe,  Dk) 

} 

In  order  to  establish  convergence  results  for  this  class  of  algorithms,  we  will,  by  and  by,  place  additional 
conditions  on  Dk ,  the  step  calculation  procedure  Explorat ory .Moves  () ,  and  the  update  procedure  Update  () . 
The  analysis  reveals  that  we  do  not  need  to  explicitly  define  Explor  at  ory  .Moves  ()  or  Update  ();  for  the 
purposes  of  ensuring  convergence  it  suffices  to  specify  conditions  on  the  results  they  produce.  We  refer  the 
interested  reader  to  [16]  for  specific  examples  of  Exploratory-Moves  ()  and  Update  ()  used  for  some  of  the 
more  traditional  pattern  search  methods. 

4.  Global  Convergence  Analysis.  Here  we  will  use  global  convergence  of  an  optimization  algorithm 
to  mean  convergence  to  a  stationary  point  of  at  least  one  subsequence  of  the  sequence  of  iterates  produced 
by  the  algorithm.  A  slightly  weaker  assertion  is 

lim inf  ||V/(xfc)||  =  0; 

k— *00 

this  is  equivalent  to  the  previous  property  if  the  iterates  {xk}  remain  in  a  bounded  set. 

Classical  analyses  of  such  methods  as  steepest  descent  and  globalized  Newton  methods  rely  in  a  fun¬ 
damental  way  on  V/(x)  to  prove  global  convergence.  Moreover,  the  technical  conditions  that  make  the 
proof  of  global  convergence  for  these  algorithms  possible,  such  as  the  Armijo-Goldstein- Wolfe  conditions  for 
line-search  methods,  are  actually  built  into  the  specification  of  gradient- based  algorithms. 

On  the  other  hand,  no  such  technical  conditions  appear  in  the  description  of  pattern  search  algorithms 
(witness  the  example  in  §2).  The  philosophy  of  pattern  search  algorithms  (and  direct  search  methods  in 
general)  is  best  described  by  Hooke  and  Jeeves  [5]: 

We  use  the  phrase  “direct  search”  to  describe  sequential  examination  of  trial  solutions 
involving  comparison  of  each  trial  solution  with  the  “best”  obtained  up  to  that  time  to¬ 
gether  with  a  strategy  for  determining  (as  a  function  of  earlier  results)  what  the  next  trial 
solution  will  be.  The  phrase  implies  our  preference,  based  on  experience,  for  straightforward 
search  strategies  which  employ  no  techniques  of  classical  analysis  except  where  there  is  a 
demonstrable  advantage  in  doing  so.1 

*It  might  strike  the  modern  reader  as  odd  that  Hooke  and  Jeeves  would  question  the  advantages  of  employing  techniques 
of  “classical  analysis” — meaning  calculus — given  the  success  of  quasi-Newton  algorithms.  However,  direct  search  methods 
appeared  in  the  late  1950s  and  early  1960s,  a  time  at  which  derivative-based  methods  were  not  as  efficient  as  today,  and  no 
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This  passage  captures  the  basic  philosophy  of  the  original  work  on  direct  search  algorithms:  an  avoidance 
of  the  explicit  use  or  approximation  of  derivatives.  Instead,  the  developers  of  the  original  direct  search 
algorithms  relied  on  heuristics  to  obtain  what  they  considered  promising  search  directions. 

Nonetheless,  we  can  prove  global  convergence  results  for  pattern  search  methods,  even  though  this  class  of 
algorithms  was  not  originally  developed  with  convergence  analysis  in  mind.  The  analysis  does  ultimately  rely 
on  V/(x);  hence  the  assumption  that  /  is  continuously  differentiable.  But  because  pattern  search  methods 
do  not  compute  or  approximate  V/( x),  the  relationship  between  these  algorithms  and  their  convergence 
analysis  is  less  direct  than  that  for  gradient-based  algorithms. 

4.1.  The  Ingredients  of  Global  Convergence  Analysis.  We  will  now  review  the  ideas  that  underlie 
the  global  convergence  analysis  of  line-search  methods  for  unconstrained  minimization  in  order  to  compare 
them  with  those  for  pattern  search.  We  focus  on  line-search  methods  rather  than  trust-region  methods  since 
the  comparisons  and  contrasts  with  pattern  search  are  simpler  for  line-search  methods. 

In  order  to  prove  global  convergence  of  a  line-search  algorithm,  at  the  very  least  one  must  show  that 
if  the  current  iterate  xk  is  not  a  stationary  point,  then  the  algorithm  will  eventually  find  an  iterate  Xk+ 1 
such  that  f{xk+ 1)  <  f{xk).  This  unavoidably  leads  to  the  contemplation  of  the  gradient,  since  the  gradient 
ensures  that  a  direction  of  descent  can  be  identified:  if  xk  is  not  a  stationary  point  of  /,  then  any  direction 
within  90°  of  -V/(xfe)  is  a  descent  direction.  For  our  purposes,  this  will  prove  a  crucial,  if  elementary, 
observation:  one  does  not  need  to  know  the  negative  gradient  in  order  to  improve  /(x),  one  only  needs  a 
direction  of  descent.  Then,  if  one  takes  a  short  enough  step  in  that  direction,  one  is  guaranteed  to  find  a 
point  xk+i  such  that  f{xk+ 1)  <  f{xk). 

However,  descent  is  not  sufficient  to  ensure  convergence:  one  must  also  rule  out  the  possibility  that  the 
algorithm  can  simply  grind  to  a  halt,  converging  to  a  point  that  is  not  a  stationary  point.  One  begins  by 
requiring  at  least  one  search  direction  to  be  uniformly  bounded  away  from  orthogonality  with  -V/(xfc). 
This  ensures  that  the  sequence  of  iterates  cannot  degenerate  into  steps  along  directions  that  become  ever 
more  orthogonal  to  the  gradient  while  producing  an  ever  diminishing  improvement  in  f(x). 

This  restriction  on  the  search  directions  is  still  not  sufficient  to  prevent  the  iterates  from  converging  to 
points  that  are  not  stationary  points.  This  unhappy  situation  can  occur  in  two  ways.  First,  there  is  the 
pathology  depicted  in  Fig.  4.1.  The  ellipse  represents  a  level  set  of  /(x),  which  in  this  case  is  a  convex 
quadratic.  The  steps  taken  are  too  long  relative  to  the  amount  of  decrease  between  successive  iterates. 
While  the  sequence  of  iterates  {xj,}  produces  a  strictly  decreasing  sequence  of  objective  values  {/(x&)},  the 


Fig.  4.1.  Decrease  is  too  small  relative  to  the  length  of  the  step 


general  convergence  analysis  existed  for  any  practical  optimization  algorithm,  derivative-based  or  not.  The  Armijo- Goldstcin- 
Wolfe  conditions  [1,  4,  19],  which  form  the  basis  for  designing  and  analyzing  what  we  now  consider  to  be  practical  line-search 
algorithms,  were  several  years  in  the  future;  trust  region  algorithms  [14]  were  further  still. 
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sequence  of  iterates  converges  to  two  nonstationary  points. 

The  other  pathology,  depicted  in  Fig.  4.2,  occurs  when  the  amount  of  decrease  between  successive  iterates 
is  too  small  relative  to  the  amount  of  decrease  initially  seen  in  the  direction  from  one  iterate  to  the  next. 
This  time  the  steps  between  successive  iterates  become  excessively  short.  This  sequence  converges  to  a  single 


Fig.  4.2.  Decrease  is  too  small  relative  to  the  norm  of  the  gradient 


point  which  again  is  not  a  stationary  point. 

These  pathologies  lead  to  the  second  standard  element  of  global  convergence  analysis:  a  mechanism  that 
controls  the  length  of  the  step.  Both  of  the  preceding  pathologies  can  be  avoided,  for  instance,  by  requiring 
that  the  amount  of  decrease  in  f(x)  between  successive  iterates  be  “sufficient,”  where  sufficient  relates  the 
amount  of  decrease,  the  length  of  the  step,  and  the  gradient  Vf(x).  This  is  the  purpose  of  the  Armijo- 
Goldstein- Wolfe  conditions  for  line-search  algorithms:  given  a  suitable  descent  direction  dk ,  we  choose  a  step 
length  Afc  >  0  such  that  for  some  fixed  a  €  (0, 1)  and  fixed  f3  €  (a,  1),  Xk+i  =  Xk  +  A kdk  satisfies  both 

(4.1)  f{xk+i)  <  f{xk)  +  aAkVf{xk)Tdk 
and 

(4.2)  V / {xk+\  )T d,k  >  /3V, / (xk )T dk . 

The  first  condition  precludes  steps  that  are  too  long;  the  second  condition  precludes  steps  that  are  too  short. 

5.  How  Pattern  Search  Does  Its  Thing.  We  can  summarize  the  devices  that  ensure  the  global 
convergence  of  line-search  methods  for  unconstrained  minimization  as  follows: 

1.  The  choice  of  a  suitably  good  descent  direction. 

2.  Step-length  control: 

(a)  a  mechanism  to  avoid  steps  that  are  too  long,  and 

(b)  a  mechanism  to  avoid  steps  that  are  too  short,  where  long  and  short  refer  to  the  sufficient 
decrease  conditions  (4.1)  and  (4.2),  respectively. 

These  mechanisms,  which  are  explicitly  built  into  line-search  algorithms,  all  depend  on  information  about 
the  gradient.  However,  pattern  search  algorithms  do  not  assume  such  information,  and  thus  do  not  and 
cannot  enforce  such  conditions.  What,  then,  ensures  the  global  convergence  of  pattern  search  algorithms? 

The  answer  resembles  the  classical  arguments  for  establishing  the  global  convergence  of  line-search 
methods,  but  necessarily  with  novel  elements.  As  we  shall  see,  pattern  search  algorithms  are  globally 
convergent  because: 

1.  At  each  iteration,  they  look  in  enough  directions  to  ensure  that  a  suitably  good  descent  direction 
will  ultimately  be  considered. 

2.  They  possess  a  reasonable  back-tracking  strategy  that  avoids  unnecessarily  short  steps. 
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Fig.  5.1.  Examples  of  a  minimal  and  a  maximal  positive  basis  for  R2 

3.  They  otherwise  avoid  unsuitable  steps  by  restricting  the  nature  of  the  step  allowed  between  successive 
iterates,  rather  than  by  placing  requirements  on  the  amount  of  decrease  realized  between  successive 
iterates. 

At  the  heart  of  the  argument  lies  an  unusual  twist:  we  relax  the  requirement  of  sufficient  decrease  and  require 
only  simple  decrease  (f{xk+i)  <  f(xk)),  but  we  impose  stronger  conditions  on  the  form  the  step  sk  may 
take.  Furthermore,  this  trade-off  is  more  than  just  a  theoretical  innovation:  in  practice,  it  permits  useful 
search  strategies  that  are  precluded  by  the  condition  of  sufficient  decrease. 

5.1.  Pattern  Search  as  a  Crypto- Gradient  Method.  The  analysis  begins  by  demonstrating  that 
a  search  direction  not  too  far  from  the  negative  gradient  is  always  available.  This  is  accomplished  by 
considering  a  set  of  step  directions  Dk  sufficiently  rich  that  it  necessarily  includes  at  least  one  acceptable 
descent  direction.  In  the  absence  of  any  estimate  of  -Vf(xk),  pattern  search  algorithms  hedge  against  the 
fact  that  —Vf(xk)  could  point  in  any  direction. 

For  the  example  in  §2  the  set  of  directions  Dk  is  {± e*,  %  =  1, . . .  ,n},  so  the  set  of  prospective  next 
iterates  has  the  simple  form  {xk  ±  A ka,  i  =  1, . . .  ,n}.  If  a  step  sk  =  ±AkeH  producing  simple  decrease  on 
f(xk)  is  found,  then  xk+i  =  xk  ±  A kei\  otherwise,  reduce  Ak  and  try  again.  Other  of  the  original  pattern 
search  methods,  such  as  the  method  of  Hooke  and  Jeeves  [5]  or  coordinate  search  [13],  also  include  in  Dk 
the  directions  {± e*,  i  —  1, . . .  ,n}. 

The  analysis  in  [16]  allows  for  more  general  conditions  on  the  set  of  directions.  In  particular,  Dk  must 
contain  a  set  of  the  form  {±Pi,  i  =  1, . . .  ,n},  where  pi, . . .  ,pn  is  any  linearly  independent  set  of  vectors. 
One  can  allow  this  set  to  vary  with  fc,  so  long  as  one  restricts  attention  to  a  finite  collection  of  such  sets. 

The  discussion  in  [18]  brought  to  our  attention  that  even  less  is  required:  it  suffices  that  the  set  of 
directions  Dk  contain  a  positive  basis  for  lRn  [7].  In  terms  of  the  theory  of  positive  linear  dependence 
[3],  the  positive  span  of  a  set  of  vectors  {a\ , . . . ,  ar}  is  the  cone 

{a  e  IRn  |  a  —  Ciai  +  . . .  +  crar ,  c*  >  0  for  all  i}. 

The  set  {ai, . . .  ,ar}  is  called  positively  dependent  if  one  of  the  a^s  is  a  nonnegative  combination  of  the 
others;  otherwise  the  set  is  positively  independent.  A  positive  basis  is  a  positively  independent  set  whose 
positive  span  is  IRn,  i.e.,  a  set  of  generators  for  a  cone  that  happens  to  be  a  vector  space.  A  positive  basis 
must  contain  at  least  n+1  vectors  and  can  contain  no  more  than  2 n  vectors  [3];  we  refer  to  the  former  as 
minimal  and  the  latter  as  maximal;  Fig.  5.1  demonstrates  examples  of  both  for  R2. 

How  do  we  know  that  at  least  one  of  the  directions  in  Dk  is  not  too  orthogonal  to  the  direction  of  steepest 
descent,  regardless  of  what  — V/(:r)  might  be?  A  proof  by  picture  is  given  in  Fig.  5.2;  see  [7]  for  details. 


Fig.  5.2.  A  minimal  positive  basis  for  R2  and  the  two  worst  cases  for  V /(:£&) 


Fig.  5.3.  Some  possible  patterns 


Consider  the  minimal  positive  basis  {(1, 1)T,  (1,  -T)t,  (— 1, 0)T}  depicted  Fig.  5.2  as  directions  emanating 
from  xk-  Notice  that  the  angles  between  these  vectors  are  90°,  135°,  and  135°.  For  any  continuously 
differentiable  function  /  :  TR 2  — ►  IR ,  if  xk  is  not  a  stationary  point,  then  -Vf(zk)  can  be  no  more  than 
67.5°  from  one  of  the  vectors  in  the  positive  basis,  as  shown  in  Fig.  5.2.  Thus,  including  a  positive  basis  Uk 
in  the  set  of  directions  Dk  guarantees  that  we  can  approximate  the  negative  gradient  in  a  way  that  cannot 
be  arbitrarily  bad.  This  is  the  first  step  towards  establishing  global  convergence. 

5.2.  The  Underlying  Lattice  Structure.  As  it  happens — as  it  was  meant  to  happen — pattern  search 
methods  are  restricted  in  the  nature  of  the  steps  they  take.  This  ultimately  turns  out  to  be  the  reason  pattern 
search  methods  can  avoid  the  pathologies  illustrated  in  Fig.  4.1  and  Fig.  4.2  without  enforcing  a  sufficient 
decrease  condition. 

Let  Pk  denote  the  set  of  candidates  for  the  next  iterate  (i.e.,  Pk  =  xk  +  A kDk,  by  abuse  of  notation). 
We  call  Pk  the  pattern ,  from  which  pattern  search  takes  its  name.  Several  traditional  patterns  are  depicted 
in  Fig.  5.3. 
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Fig.  5.4.  The  same  patterns  on  a  refinement  of  the  grid 


Though  it  does  not  appear  to  have  been  by  any  conscious  design  on  the  part  of  the  original  developers  of 
pattern  search  algorithms,  these  algorithms  produce  iterates  that  lie  on  a  finitely  generated  rational  lattice, 
as  indicated  in  Fig.  5.3.  More  precisely,*  there  exists  a  set  of  generators  #i, . . .  ,#m,  independent  of  k ,  such 
that  any  iterate  Xk  can  be  written  as 

m 

(5.1)  %k  =  Xq  "f" 

i—1 

where  each  c1-  is  rational. 

Note  that  this  structural  feature  means  that  the  set  of  possible  steps,  and  thus  the  set  of  possible 
iterates,  is  known  in  advance  and  is  independent  of  the  actual  objective  f(x).  This  is  in  obvious  contrast  to 
gradient-based  methods. 

Furthermore — and  this  is  significant  to  the  convergence  analysis  of  pattern  search — by  judicious  (but 
not  especially  restrictive)  choice  of  the  factors  by  which  can  be  increased  or  decreased,  we  can  establish 
the  following  behavior  of  these  algorithms.  Suppose  the  set  {  x  |  /( x)  <  f(x 0)  }  is  bounded.  Given  any 
A*  >  0,  there  exists  a  finite  subset  (that  depends  on  A*)  of  the  lattice  of  all  possible  iterates  such  that  x k 
must  belong  to  this  subset  until  A&  <  A*.  That  is,  there  is  only  a  finite  number  of  distinct  values  xk  can 
possibly  have  until  such  time  as  A &  <  A* .  This  means  that  the  only  way  to  obtain  an  infinite  sequence  of 
distinct  x k  is  to  reduce  the  step  length  parameter  A&  infinitely  often  so  that  lim  inf k^-oo  A&  =  0. 

This  also  reveals  another  role  played  by  the  parameter  A*.  Reducing  A &  increases  the  set  of  candidate 
iterates  by  allowing  us  to  search  over  a  finer  subset  of  the  rational  lattice  of  all  possible  iterates.  This  is 
shown  in  Fig.  5.3  and  Fig.  5.4.  In  these  pictures,  halving  Ak  refines  the  grid  over  which  we  are  tacitly 
searching  for  a  minimizer  of  f(x),  while  halving  the  minimum  length  a  step  is  allowed  to  have. 

5.3.  Putting  It  All  Together.  We  return  to  the  general  pattern  search  algorithm: 

Generalized  pattern  search: 


Given  x0  £  JRn ,  /(*0),  D0  £  IRnxv\  and  A0  >  0, 
for  k  =  0, 1, . . .  until  done  do  { 

1.  Find  a  step  sk  =  A kdk  using  the  procedure  Exploratory_Moves(Afc,  Z>^). 

2.  If  +  Afcdfe)  <  /(^fc),  then  a*+1  =  s*  +  Akdk\  otherwise,  xk+i  =  % k - 

3.  Update  (A*,  Dk) 

} 

The  step  sk  returned  by  the  Exploratory-Moves  0  algorithm  must  satisfy  two  simple  conditions: 

1.  The  step  returned  must  be  an  element  of  A kDk. 

2.  The  step  sk  must  satisfy  either  f(xk  +  sk)  <  f(xk)  or  sk  =  0. 

Furthermore,  sk  may  be  0  only  if  none  of  the  steps  in  Afe IIa;  yielded  decrease  on  f(xk). 

The  first  condition  prevents  arbitrary  steps  along  arbitrary  directions;  the  second  condition  is  a  back-tracking 
control  mechanism  that  prevents  us  from  taking  shorter  steps  unless  it  is  truly  necessary. 

As  for  the  procedure  Update  ()  for  Dk  and  Aky  we  are  free  to  make  modifications  to  Dk  before  the 
next  iteration.  Classical  pattern  search  methods  typically  specify  a  single  D  =  Dk  for  all  k.  Others  make 
substantive  changes  in  response  to  the  outcome  of  the  exploratory  moves.  This  is  just  one  of  many  options 
to  consider  when  designing  a  pattern  search  method,  and  it  leads  to  a  great  deal  of  flexibility  in  this  class 
of  algorithms.  There  are  conditions  that  must  be  satisfied  to  preserve  the  lattice  structure,  but  these  are 
straightforward  to  satisfy  in  practice.  The  interested  reader  is  referred  to  [7],  for  a  complete  discussion  of 
the  technical  conditions,  and  to  [16],  for  a  description  of  some  traditional  choices. 

The  rules  for  updating  Ak  are  also  restricted  by  the  need  to  preserve  the  algebraic  structure  of  the 
possible  iterates.  Historically,  the  popular  choices  have  been  to  halve  Ak  at  unsuccessful  iterations,  and 
to  either  leave  Ak  alone  at  successful  iterations  or  possibly  double  it.  The  convergence  analysis  leads  to 
other  possibilities:  we  can  rescale  Ak  by  0  £  {t"°,  . . .  ,7^},  where  r  is  a  rational  number,  {a;0, . . .  are 
integers,  L  >  2,  cj0  <  0  and  uL  >  0.  This  provides  at  least  one  option  for  reducing  Ak  when  back-tracking 
is  called  for,  and  at  least  one  option  that  does  not  reduce  Ak. 

The  proof  of  convergence  now  goes  like  this.  Suppose  xk  is  not  a  stationary  point  of  f(x).  Because  at 
least  one  of  the  directions  dk  in  Uk  is  necessarily  a  descent  direction,  we  can  always  find  an  acceptable  step 
A kdk  once  we  reduce  Ak  sufficiently.  Thus,  we  can  always  find  ®k+i  with  f(xk+ 1)  <  f{%k)  for  k  in  some 
subsequence  K . 

Now,  if  liminffe_oo  ||V/(a?fc)||  ^  0,  then  for  some  £  >  0,  ||V/(afe)||  >  e  for  all  k.  Under  this  assumption 
we  can  show  that  once  Ak  is  sufficiently  small  relative  to  e,  it  will  no  longer  be  reduced.  This  is  so  because 
one  of  the  directions  dk  in  Uk  is  sufficiently  close  to  —Vf(xk)  to  be  a  uniformly  good  descent  direction,  and 
||V/(fl*)||  is  uniformly  not  too  small,  so  we  will  have  f(xk  +  Akdk)  <  f(xk)  without  having  to  drive  Ak  to 
zero. 

However,  if  liminffc^oo  A*  =  A*  >  0,  then  due  to  the  lattice  structure  of  the  iterates,  there  can  be  only 
finitely  many  possible  xk,  contradicting  the  fact  that  we  have  an  infinite  subsequence  K  with  f(xk+1)  <  f(xk) 
for  all  k  £  K  (assuming  {  x  \  f(x)  <  f{x 0)  }  is  bounded).  Hence  liminffc_>oo  ||V/0e*)||  =  0. 

The  correlation  between  the  fineness  of  the  grid  of  possible  iterates  and  the  size  of  A&  also  explains  why 
long  steps  are  not  a  problem.  We  have  argued  above  that  if  liminffc_>oo  Ak  —  0,  then  liminf/^oo  ||  V/(xfc)||  = 
0.  Now,  unless  liminf/^oo  Ak  =  0,  there  can  be  only  a  finite  number  of  distinct  iterates,  and  hence  only 
a  finite  number  of  long  steps  (or  any  type  of  step,  for  that  matter).  Thus  even  if  an  infinite  number  of 
“bad”  long  steps  are  taken  (i.e.,  steps  that  decrease  f(x)  but  that  violate  (4.1)),  the  mere  fact  that  there 
are  infinitely  many  distinct  iterates  means  that  liminffc-yoo  Ak  =  0,  and  hence  hminffc-^oo  ||V/(.'rfc)||  =  0. 
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5.4.  Observations.  This  analysis  might  suggest  an  interpretation  of  pattern  search  as  a  search  over 
successively  finer  finite  grids.  If  the  finite  set  of  candidates  is  exhausted  without  finding  a  point  that  improves 
/(.?;),  then  the  grid  is  refined  by  reducing  Ak  and  the  process  is  repeated. 

However,  this  interpretation  is  misleading  insofar  as  it  suggests  that  pattern  search  algorithms  are 
exceedingly  inefficient.  In  practice,  pattern  search  algorithms  do  not  resort  to  searching  over  all  the  points  in 
increasingly  fine  grids  but  instead  behave  more  like  a  steepest  descent  method.  In  this  sense,  the  analysis  does 
not  reflect  the  actual  behavior  of  the  algorithm.  This  should  not  be  entirely  surprising  since,  unlike  gradient- 
based  methods,  the  specification  of  pattern  search  algorithms  does  not  obviously  contain  a  mechanism 
designed  to  guarantee  convergence. 

The  situation  is  analogous  to  that  of  the  simplex  method  in  linear  programming.  Once  one  establishes 
that  the  simplex  method  cannot  cycle,  the  convergence  of  the  algorithm  follows  from  the  fact  that  there  is 
only  a  finite  number  of  vertices  that  the  simplex  method  can  visit  in  its  search  for  a  solution.  This  means 
that  the  simplex  method  could  and  does  have  a  theoretical  worst-case  complexity  that  is  exponential,  but 
in  practice  the  simplex  method  has  proven  much  more  efficient  than  that. 

Moreover,  the  actual  behavior  of  pattern  search  in  any  single  iteration  can  be  very  different  than  the 
proof  of  convergence  might  be  thought  to  suggest.  The  search  can  accept  as  the  next  iterate  any  point 
in  Pk  that  satisfies  the  simple  decrease  condition  f(xk+ 1)  <  f(xk).  In  particular,  the  algorithm  does  not 
necessarily  need  to  examine  every  point  in  AkUk ;  it  need  only  do  so  before  deciding  to  reduce  A*,  which  is 
the  worst  case. 

In  the  best  case,  we  may  need  only  a  single  evaluation  of  f(x)  to  find  an  acceptable  step.  In  contrast,  in 
a  forward-difference  gradient- based  method  one  needs  at  least  n+ 1  evaluations  of  /  (,r)  (in  addition  to  /  (rr^)) 
to  find  a  new  iterate;  n  additional  values  of  /( x)  to  approximate  Vf(xk)  and  at  least  one  more  evaluation 
of  f(x)  to  decide  whether  or  not  to  accept  a  new  iterate. 

In  order  to  make  progress,  pattern  search  requires  the  eventual  reduction  of  Ak.  The  cost  of  discovering 
the  necessity  of  this  step  is  one  evaluation  of  f(x)  for  each  direction  defined  by  the  positive  basis  Uk.  For  a 
minimal  positive  basis  of  n+ 1  elements,  this  cost  is  the  same  as  the  cost  of  an  unsuccessful  quasi-Newton  step 
using  a  forward-difference  approximation  of  the  gradient;  n  evaluations  of  f(x)  to  form  the  finite-difference 
approximation  to  Vf(xk ),  and  the  evaluation  of  f(x)  at  the  rejected  x+ .  On  the  other  hand,  following  an 
unsuccessful  step  in  the  latter  algorithm,  one  gets  to  reuse  the  gradient  approximation;  it  is  not  clear  how 
best  to  reuse  information  from  unsuccessful  iterations  of  pattern  search  in  subsequent  iterations. 

5.5.  The  Resulting  Convergence  Results.  Let  Cl  =  {  x  |  f(x)  <  f(x o)  },  and  suppose  /  is  C 1  on 
a  neighborhood  of  Cl. 

Theorem  If  Cl  is  bounded,  then  the  iterates  produced  by  a  pattern  search  algorithm  satisfy 

hminf||V/(rr;fc)||=0. 

k— >oo 

If,  in  addition,  limfc_>0o  =  0  and  we  require  f(xk+ 1)  <  f(xk  +  Sk )  for  all  s&  €  A^IIfe,  the  steps  associated 
with  the  positive  basis,  and  the  columns  of  Dk  are  bounded  in  norm  uniformly  in  k,  then  we  have 

lim  ||V/(*fc)ll  =  0. 

k—KX> 

By  way  of  comparison,  we  obtain  the  result  lim/^oo  ||V/(.%‘A;)||  =  0  for  line-search  methods  without 
the  assumption  that  Cl  is  bounded  [12].  However,  we  must  also  require  sufficient  decrease  between  iterates 
according  to  (4.1)-(4.2),  rather  than  just  simple  decrease,  that  is,  f(xk+ 1)  <  f(xk). 
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For  trust-region  methods,  with  the  assumption  that  V/(.x)  is  uniformly  continuous  (but  again,  without 
the  assumption  that  Cl  is  bounded),  requiring  only  simple  decrease  f(xk+ 1)  <  f(z'k)  suffices  to  prove  that 
lim^oo  ||  V/(arjfe)||  =  0,  provided  the  approximation  of  the  Hessian  does  not  grow  too  rapidly  in  norm  [15]. 
With  a  sufficient  decrease  condition,  one  obtains  the  stronger  result  [11],  limfe— <*>  ||V/(«fc)||  =  0.  However, 
for  either  result  Vf(x)  is  used  in  both  the  fraction  of  Cauchy  decrease  condition  on  the  step  and  the  update 
of  the  trust  radius. 

Thus,  under  the  hypothesis  that  Cl  is  bounded,  the  global  convergence  results  for  pattern  search  algo¬ 
rithms  are  as  strong  as  those  for  gradient- based  methods.  This  might  seem  surprising,  but  it  simply  reflects 
just  how  little  one  needs  to  establish  global  convergence.  Pattern  search  is  sufficiently  like  steepest  descent 
that  it  works. 

This  leads  to  one  caveat  for  users:  like  steepest  descent,  pattern  search  methods  are  good  at  improving 
an  initial  guess  and  finding  a  neighborhood  of  a  local  solution,  but  fast  local  convergence  should  not  be 
expected.  In  general,  one  can  expect  only  a  linear  rate  of  local  convergence. 

6.  Concluding  Remarks.  We  have  tried  to  explain  how  and  why  pattern  search  works  while  refraining 
from  a  detailed  description  of  the  convergence  analysis.  Once  one  understands  the  essential  ideas,  the  proof 
of  global  convergence  is  reasonably  straightforward,  if  sometimes  tedious.  Precisely  because  pattern  search 
methods  have  so  little  analytical  information  explicitly  built  into  them,  it  takes  some  effort  to  extract  an 
assurance  that  they  actually  do  work.  However,  as  we  have  tried  to  indicate,  many  of  the  ideas  are  familiar 
from  standard  analysis  of  nonlinear  programming  algorithms.  The  novelty  lies  in  the  restriction  of  the 
iterates  to  a  lattice,  which  allows  us  to  relax  the  conditions  on  accepting  steps. 

The  ideas  discussed  here  also  appear  in  the  analysis  of  pattern  search  methods  for  constrained  mini¬ 
mization  [6,  9,  8].  For  readers  who  would  like  to  explore  the  connections  between  pattern  search  methods 
and  gradient-based  algorithms  in  greater  detail,  we  particularly  recommend  [10]. 
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